Existing automated techniques for software documentation typically attempt to reason between two main sources of information: code and natural language. However, this reasoning process is often complicated by the lexical gap between more abstract natural language and more structured programming languages. One potential bridge for this gap is the Graphical User Interface (GUI), as GUIs inherently encode salient information about underlying program functionality into rich, pixel-based data representations. This paper offers one of the first comprehensive empirical investigations into the connection between GUIs and functional, natural language descriptions of software. First, we collect, analyze, and open source a large dataset of functional GUI descriptions consisting of 45,998 descriptions for 10,204 screenshots from popular Android applications. The descriptions were obtained from human labelers and underwent several quality control mechanisms. To gain insight into the representational potential of GUIs, we investigate the ability of four Neural Image Captioning models to predict natural language descriptions of varying granularity when provided a screenshot as input. We evaluate these models quantitatively, using common machine translation metrics, and qualitatively through a large-scale user study. Finally, we offer learned lessons and a discussion of the potential shown by multimodal models to enhance future techniques for automated software documentation.
translated by 谷歌翻译
Vehicle-to-Everything (V2X) communication has been proposed as a potential solution to improve the robustness and safety of autonomous vehicles by improving coordination and removing the barrier of non-line-of-sight sensing. Cooperative Vehicle Safety (CVS) applications are tightly dependent on the reliability of the underneath data system, which can suffer from loss of information due to the inherent issues of their different components, such as sensors failures or the poor performance of V2X technologies under dense communication channel load. Particularly, information loss affects the target classification module and, subsequently, the safety application performance. To enable reliable and robust CVS systems that mitigate the effect of information loss, we proposed a Context-Aware Target Classification (CA-TC) module coupled with a hybrid learning-based predictive modeling technique for CVS systems. The CA-TC consists of two modules: A Context-Aware Map (CAM), and a Hybrid Gaussian Process (HGP) prediction system. Consequently, the vehicle safety applications use the information from the CA-TC, making them more robust and reliable. The CAM leverages vehicles path history, road geometry, tracking, and prediction; and the HGP is utilized to provide accurate vehicles' trajectory predictions to compensate for data loss (due to communication congestion) or sensor measurements' inaccuracies. Based on offline real-world data, we learn a finite bank of driver models that represent the joint dynamics of the vehicle and the drivers' behavior. We combine offline training and online model updates with on-the-fly forecasting to account for new possible driver behaviors. Finally, our framework is validated using simulation and realistic driving scenarios to confirm its potential in enhancing the robustness and reliability of CVS systems.
translated by 谷歌翻译
Reliable forecasting of traffic flow requires efficient modeling of traffic data. Different correlations and influences arise in a dynamic traffic network, making modeling a complicated task. Existing literature has proposed many different methods to capture the complex underlying spatial-temporal relations of traffic networks. However, methods still struggle to capture different local and global dependencies of long-range nature. Also, as more and more sophisticated methods are being proposed, models are increasingly becoming memory-heavy and, thus, unsuitable for low-powered devices. In this paper, we focus on solving these problems by proposing a novel deep learning framework - STLGRU. Specifically, our proposed STLGRU can effectively capture both local and global spatial-temporal relations of a traffic network using memory-augmented attention and gating mechanism. Instead of employing separate temporal and spatial components, we show that our memory module and gated unit can learn the spatial-temporal dependencies successfully, allowing for reduced memory usage with fewer parameters. We extensively experiment on several real-world traffic prediction datasets to show that our model performs better than existing methods while the memory footprint remains lower. Code is available at \url{https://github.com/Kishor-Bhaumik/STLGRU}.
translated by 谷歌翻译
讽刺可以被定义为说或写讽刺与一个人真正想表达的相反,通常是为了侮辱,刺激或娱乐某人。由于文本数据中讽刺性的性质晦涩难懂,因此检测到情感分析研究社区的困难和非常感兴趣。尽管讽刺检测的研究跨越了十多年,但最近已经取得了一些重大进步,包括在多模式环境中采用了无监督的预训练的预训练的变压器,并整合了环境以识别讽刺。在这项研究中,我们旨在简要概述英语计算讽刺研究的最新进步和趋势。我们描述了与讽刺有关的相关数据集,方法,趋势,问题,挑战和任务,这些数据集,趋势,问题,挑战和任务是无法检测到的。我们的研究提供了讽刺数据集,讽刺特征及其提取方法以及各种方法的性能分析,这些表可以帮助相关领域的研究人员了解当前的讽刺检测中最新实践。
translated by 谷歌翻译
我们提出了一条新型的神经管道Msgazenet,该管道通过通过多发射框架利用眼睛解剖学信息来学习凝视的表示。我们提出的解决方案包括两个组件,首先是一个用于隔离解剖眼区域的网络,以及第二个用于多发达凝视估计的网络。眼睛区域的隔离是通过U-NET样式网络进行的,我们使用合成数据集训练该网络,该数据集包含可见眼球和虹膜区域的眼睛区域掩模。此阶段使用的合成数据集是一个由60,000张眼睛图像组成的新数据集,我们使用眼视线模拟器Unityeyes创建。然后将眼睛区域隔离网络转移到真实域,以生成真实世界图像的面具。为了成功进行转移,我们在训练过程中利用域随机化,这允许合成图像从较大的差异中受益,并在类似于伪影的增强的帮助下从更大的差异中受益。然后,生成的眼睛区域掩模与原始眼睛图像一起用作我们凝视估计网络的多式输入。我们在三个基准凝视估计数据集(Mpiigaze,Eyediap和Utmultiview)上评估框架,在那里我们通过分别获得7.57%和1.85%的性能,在Eyediap和Utmultiview数据集上设置了新的最新技术Mpiigaze的竞争性能。我们还研究了方法在数据中的噪声方面的鲁棒性,并证明我们的模型对噪声数据不太敏感。最后,我们执行各种实验,包括消融研究,以评估解决方案中不同组件和设计选择的贡献。
translated by 谷歌翻译
我们引入了统一的单一和多视图神经隐式3D重建框架VPFusion。 VPFusion使用-3D功能卷获得高质量的重建,以捕获3D结构感知的上下文和像素对齐的图像特征,以捕获精细的本地细节。现有方法使用RNN,功能池或注意力在每个视图中独立计算以进行多视图融合。 RNN遭受长期记忆丧失和置换差异的困扰,而特征池或独立计算的注意力会导致每种视图中的表示形式在最后的合并步骤之前都不知道其他视图。相比之下,我们通过建立基于变压器的成对视图关联来显示改进的多视图融合。特别是,我们提出了一种新颖的交错3D推理和成对视图的关联结构,以跨不同视图的特征体积融合。使用此结构感知和多视图感知功能量,与现有方法相比,我们显示出改进的3D重建性能。 VPFusion还通过合并与像素一致的本地图像功能来进一步提高重建质量,以捕获细节。我们验证了VPFusion在Shapenet和ModelNet数据集上的有效性,在该数据集中,我们在该数据集中胜过或执行最先进的单个和多视图3D形状重建方法。
translated by 谷歌翻译
我们提出了一种新型多阵线网络,用于了解凝视估计的强大眼睛表示。我们首先使用模拟器创建包含细节可见眼球和虹膜的眼睛区域掩模的合成数据集。然后,我们用U-Net类型模型执行眼部区域分割,我们以后用于生成真实眼睛图像的眼睛区域掩模。接下来,我们在真实域中预留眼睛图像编码器,具有自我监督的对比学习,以学习广义眼睛表示。最后,这种预制的眼编码器以及两个用于可见眼球区域和虹膜的另外的编码器,在我们的多阵线框架中并行使用,以提取来自现实世界图像的凝视估计的突出特征。我们在两个不同的评估设置中展示了我们对眼部数据集的方法的性能,实现了最先进的结果,优于此数据集的所有现有基准。我们还开展额外的实验,以验证我们自我监督网络的鲁棒性,了解用于培训的不同数量的标记数据。
translated by 谷歌翻译
识别语音情绪的语言不可知论的方法仍然是一个不完整和具有挑战性的任务。在本文中,我们使用Bangla和英语语言来评估与语音中的情感是否与语言无关。这项研究分类了以下情绪:幸福,愤怒,中立,悲伤,厌恶和恐惧。我们雇用了三种情绪言论,其中前两组是由孟加拉和英语语言的本土孟加拉语扬声器开发的。第三个是多伦多情感演讲(苔丝),由加拿大母语的英语发言者开发。我们仔细选择了语言无关的韵律特征,采用了支持向量机(SVM)模型,并进行了三个实验来执行我们的主张。在第一个实验中,我们单独测量三种语音组的性能。接下来是第二种实验,我们通过组合语音集来记录分类率。最后,在第三个实验中,我们通过培训和测试不同语音集来测量识别率。虽然这项研究表明,言语情感认可(SER)大多是语言无关的,但在识别出在这两种语言中的厌恶和恐惧之类的情绪状态时存在一些差异。此外,我们的调查推断出非母语人员通过言语传达情绪,就像以其母语在母语中表达自己。
translated by 谷歌翻译
心血管疾病是世界各地最常见的死亡原因。为了检测和治疗心脏相关的疾病,需要连续血压(BP)监测以及许多其他参数。为此目的开发了几种侵入性和非侵入性方法。用于持续监测BP的医院中使用的大多数现有方法是侵入性的。相反,基于袖带的BP监测方法,可以预测收缩压(SBP)和舒张压(DBP),不能用于连续监测。几项研究试图从非侵​​入性可收集信号(例如光学肌谱(PPG)和心电图(ECG))预测BP,其可用于连续监测。在这项研究中,我们探讨了自动化器在PPG和ECG信号中预测BP的适用性。在12,000岁的MIMIC-II数据集中进行了调查,发现了一个非常浅的一维AutoEncoder可以提取相关功能,以预测与最先进的SBP和DBP在非常大的数据集上的性能。从模拟-II数据集的一部分的独立测试分别为SBP和DBP提供了2.333和0.713的MAE。在40个主题的外部数据集上,模型在MIMIC-II数据集上培训,分别为SBP和DBP提供2.728和1.166的MAE。对于这种情况来说,结果达到了英国高血压协会(BHS)A级并超越了目前文学的研究。
translated by 谷歌翻译
动态手势识别任务已经看过各种单向和多式联运方法的研究。此前,研究人员已经探索了深度和基于2D骨架的多模式融合CRNN(卷积经常性神经网络),但在获得预期识别结果方面存在局限性。在本文中,我们重新审视了这种方法来手势识别并提出了几种改进。我们观察到原始深度图像在感兴趣的手区域(ROI)中具有低对比度。它们不突出显示重要的精细细节,例如手指方向,在手指和手掌之间重叠,或在多个手指之间重叠。因此,我们提出将深度值量化到几个离散区域中,以在手的若干关键部分之间产生更高的对比度。此外,我们提出了几种方法来解决现有多模式融合CRNN架构中的高方差问题。我们在两个基准上评估我们的方法:DHG-14/28数据集和SHREC'17 Track DataSet。我们的方法显示了先前类似的多模式方法的准确性和参数效率的显着提高,与最先进的结果相当的结果。
translated by 谷歌翻译